import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import plotly.graph_objects as go
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler
Assigning scores to establishments was not an easy task.
The idea that we found to be the most reasonable consisted of doing the following:
To compute the critical score, we used the features constructed in the FeatureForSafetyScore.ipynb notebook. The idea in that notebook was to have features which lie between 0 and 1 and which reflect the inspection outcomes. Each computed feature is explained in that notebook.
For the critical score, we examined each feature and gave it a reasonable weight (which would result in an overall score between 0 and 1) after deliberation:
These weights seem arbitrary but we reasoned to obtain them. We also will be validating them later.
def score(violation_per_inspection, critical_violations_per_inspection, moderate_violations_per_inspections,
critical_violation_ratio, non_critival_violation_per_inspection, poisoning_ratio, allergen_ratio):
"""
Function to calculate overall critical score
Arguments:
violation_per_inspection -
critical_violations_per_inspection -
moderate_violations_per_inspections -
critical_violation_ratio -
non_critival_violation_per_inspection -
"""
return 0.15 * violation_per_inspection +\
0.2 * critical_violations_per_inspection +\
0.1 * moderate_violations_per_inspections +\
0.1 * critical_violation_ratio +\
0.05 * non_critival_violation_per_inspection +\
0.2 * poisoning_ratio +\
0.2 * allergen_ratio
scores = pd.read_pickle('./pickles/features_for_score')
scores['Critics Score'] = 1 - score(scores['Violation per Inspection'],
scores['Critical Violation per Inspection'],
scores['Moderate Violation per Inspection'],
scores['Critical Violations Ratio'],
scores['Non-Critical Violation per Inspection'],
scores['Yes Ratio of VomitDiarrheal'],
scores['Yes Ratio of Allergen'])
scores_cluster = scores[['Violation per Inspection', 'Critical Violation per Inspection','Moderate Violation per Inspection',
'Critical Violations Ratio', 'Non-Critical Violation per Inspection', 'Yes Ratio of VomitDiarrheal',
'Yes Ratio of Allergen']]
scores_cluster
scores['Critics Score'].describe()
plt.hist(scores['Critics Score'])
We can now procede to normalize the scores so that they lie between 0 and 1.
# import numpy as np
# scaler = MinMaxScaler()
# new_scores = scaler.fit_transform(scores['Critics Score'].values.reshape((-1,1)))
# scores['Critics Score'] = new_scores.reshape(1,-1)[0]
To validate the scores, we perform clustering after dimensionality reduction.
We first reduce the dimensionality of the data using PCA on the features we have (without the critical score).
reduced_data_2d = PCA(n_components=2).fit_transform(scores_cluster)
reduced_data_3d = PCA(n_components=3).fit_transform(scores_cluster)
After that, we perform clustering on the reduced data. This will show us whether establishments with similar scores cluster together.
kmeans_2d = KMeans(n_clusters=8).fit(reduced_data_2d)
y_kmeans_2d = kmeans_2d.predict(reduced_data_2d)
kmeans_3d = KMeans(n_clusters=8).fit(reduced_data_3d)
y_kmeans_3d = kmeans_3d.predict(reduced_data_3d)
labels = scores['Critics Score']
fig = go.Figure()
fig.update_layout(
autosize=False,
width=1000,
height=1000)
fig.add_trace(go.Scatter(
x=reduced_data_3d[:, 0],
y=reduced_data_3d[:, 1],
mode="markers",
hovertext=labels,
marker_color=y_kmeans_2d,
#colors=y_kmeans
))
fig.update_layout(title_text="PCA plot with K-Means clustering (2D)")
fig.show()
# centers = kmeans.cluster_centers_
# plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
# plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=y_kmeans, s=50, cmap='viridis')
labels = scores['Critics Score']
fig = go.Figure()
fig.update_layout(
autosize=False,
width=1000,
height=1000)
fig.add_trace(go.Scatter3d(
x=reduced_data_3d[:, 0],
y=reduced_data_3d[:, 1],
z=reduced_data_3d[:, 2],
mode="markers",
hovertext=labels,
marker_color=y_kmeans_3d,
))
fig.update_layout(title_text="PCA plot with K-Means clustering (3D)")
fig.show()
# centers = kmeans.cluster_centers_
# plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
We can see that there is a general trend in the data. The critical score gets lower as we go to the rght side of the plot whereas the really high critical scores are concentrated in the left side of the plot. Moreover, the lower scores are clustered in a group on the right and the high scores construct many clusters on the left.
We can hence see that the score we compute does in fact reflect the features that reflect the violations commited by the facilities.
The user scored is computed in the FeatureForSafetyScore.ipynb notebook from the Yelp user rating. It is the one we will use as the user score.